Job management with SLURM¶
You should not run your compute code directly on the terminal you find when you log in.
In order to submit a job on the DGX, you need to describe the resources (partition, MIGs, CPUs) you need to the task manager Slurm. The task manager will launch the job as soon as the resources you need will be available.
You can run up to 4 jobs at a given time. All subsequent requests will be put on hold until one of your previous jobs is completed.
There are two ways to run a compute code on the DGX:
- using a interactive Slurm job: this will open an interactive session where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
- using a Slurm script: this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.
Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.
You can check anytime your usage and fairshare information with the command sshare -l
.
You can also check the priority information of the pending jobs with sprio
.
In addition to that page which documents slurm commands in the context of the DGX, you can check the slurm workload manager documention.
Slurm script¶
Most of the time, you will run your code through a Slurm script. This script has the following functions:
- specify the resources you need for your code: partition, type of MIG, how many tasks and CPUs per task;
- specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.);
- setup the batch environment (load modules, set environment variables);
- run the code!
Running the code will depend on your executable. Parallel codes may have to use srun
or having specific environment variables set.
Slurm directives¶
You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These specifications can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.
Mandatory slurm directives on the DGX¶
partition¶
Mandatory
: there is no default partition, thus you must specify this setting by choosing a partition following the needed resources in
the list of available partitions. Then to specify the Slurm partition your job will be assigned:
#SBATCH --partition=<PartitionName>
gres¶
Mandatory
: specify which type of MIG you want, by using generic ressources (gres):
#SBATCH --gres=gpu:<Type>:<Quantity>
with <Type>:<Quantity>
either 1g.10gb:1 for partition prod10
, 2g.20gb:1 for prod20
or 3g.40gb:1 for prod40
.
ntasks¶
Mandatory
: specify the number of tasks (MPI processes):
#SBATCH --ntasks=<ntasks>
If you don't need to run commands in parallel, just ask for one task (#SBATCH --ntasks=1
).
cpu-per-task¶
Mandatory
: specify the number of threads per process (Ex: OpenMP threads per MPI process):
#SBATCH --cpus-per-task=<ntpt>
SBATCH other directives¶
error¶
Define the error output (stderr) for your job:
#SBATCH --error=</path/to/errorJob>
By default both standard output and standard error are directed to the same file.
export¶
Export user environment variables. By default all user environment variables will be loaded (--export=ALL). To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended.
In order to not export environment variables present at job submission time to the job's environment:
#SBATCH --export=NONE
To select explicitly exported variables from the caller's environment to the job environment:
#SBATCH --export=VAR1,VAR2
You can also assign values to these exported variables, for example:
#SBATCH --export=VAR1=10,VAR2=18
job-name¶
Define the job's name:
#SBATCH --job-name=<jobName>
mail-type¶
To be notify by mail (defined by mail-user) when a step has been reached:
#SBATCH --mail-type=ALL
Arguments for -mail-type option are:
- BEGIN: send an email when the job starts
- END: send an email when the job stops
- FAIL: send an email if the job fails
- ALL: equivalent to BEGIN, END, FAIL.
mail-user¶
Set an email address (useful to be notified according to the option chosen with mail-type):
#SBATCH --mail-user=firstname.lastname@mywebserver.com
If used, this must not be empty.
output¶
Define the standard output (stdout) for your job:
#SBATCH --output=</path/to/outputJob>
The default is --output=slurm-%j.out
(the %j in the filename will be replaced by the job ID automatically at file creation).
If you need to direct the stdout to a specific directory, you must first create the directory, say logs
, and then set the option as --output=logs/slurm-%j.out
.
propagate¶
By default all resources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL
). To avoid the propagation of interactive limits and erase batch resources limits, it is encouraged to disable the functionality:
#SBATCH --propagate=NONE
time¶
Specify the walltime for your job (within the Max Walltime
of the partition). If your job is still running after the walltime duration, your job will be killed:
#SBATCH --time=<hh:mm:ss>
Submit and monitor jobs¶
Examples of template of batch file to execute a main.py file¶
A somewhat general template of a script job.batch which run a main.py
file, including all mandatory directives (partition, gres, ntasks and cpus-per-task):
#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output /path/to/slurm-%j.out
#SBATCH --error /path/to/slurm-%j.err
## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10
## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1
## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
## Perform run
python3 /path/to/main.py
Another one using a virtual environment, a logslurms directory (for the output and error) and a working_directory (containing the main.py
) in the user home:
#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output=~/logslurms/slurm-%j.out
#SBATCH --error=~/logslurms/slurm-%j.err
## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10
## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1
## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
## Virtual environment
source ~/env/bin/activate
## Perform run
CUDA_VISIBLE_DEVICES=1 time python ~/working_directory/main.py
In both examples, standard output (stdout) will be in the slurm-%j.out
file (the %j will be replaced by the job ID automatically) and the standard error (stderr) will be in the slurm-%j.err
file.
submit job¶
You need to submit your script job.batch with:
$ sbatch /path/to/job.batch
Submitted batch job 29509
which responds with the JobID attributed to the job. For example here, JobID is 29509. The JobID is a unique identifier that is used by many Slurm commands.
monitor job¶
The squeue
command shows the list of jobs:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
29509 prod10 job username R 0:02 1 dgxa100
You can change the default format through the SQUEUE_FORMAT variable. For example by adding the following in your .bash_profile:
export SQUEUE_FORMAT="%.18i %.14P %.8j %.10u %.2t %.10M %20b %R"
resulting in replacing the NODES information (always 1 since there is only the DGX) by the MIG required by the job (column TRES_PER_NODE):
JOBID PARTITION NAME USER ST TIME TRES_PER_NODE NODELIST(REASON)
For more squeue format option, see
cancel job¶
The scancel
command cancels jobs.
To cancel your job with jobid 29509 (obtained when submitting or through squeue), you would use:
$ scancel 29509
interactive jobs¶
For an interactive session:
- the partition must be interactive10;
- the reserved MIG must be one 1g.10gb;
- total CPUs requested (ntasks * cpus-per-task) must not exceed 4 CPUs
- example:
$ srun --partition=interactive10 --gres=gpu:1g.10gb:1 --ntasks=1 --cpus-per-task=4 --pty /bin/bash
$ squeue
JOBID PARTITION NAME USER ST TIME TRES_PER_NODE CPUS MIN_MEMORY NODELIST(REASON)
462 interactive10 bash username R 0:05 gres:gpu:1g.10gb:1 4 4000M dgxa100
$ nvidia-smi
Thu Jul 13 13:01:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:01:00.0 Off | On |
| N/A 48C P0 57W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:47:00.0 Off | On |
| N/A 48C P0 65W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:81:00.0 Off | On |
| N/A 48C P0 58W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display Off | 00000000:C1:00.0 Off | N/A |
| 34% 44C P8 N/A / 50W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... Off | 00000000:C2:00.0 Off | On |
| N/A 47C P0 56W / 275W | 48MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
To close the session, use the exit command.
job arrays¶
Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.
# Submit a job array with index values between 0 and 31
$ sbatch --array=0-31 job
# Submit a job array with index values of 1, 3, 5 and 7
$ sbatch --array=1,3,5,7 job
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2 job
The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.
chain jobs¶
If you want to submit a job which must be executed after another job, you can use the chain function in slurm.
$ sbatch slurm_script1.sh
Submitted batch job 74698
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
74698 ******* ******* username PD 0:00 * *******
$ sbatch --dependency=afterok:74698 slurm_script2.sh
Submitted batch job 74699
$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh
Submitted batch job 74700
Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command.
If you want these jobs to be automatically canceled on failure, you must specify the --kill-on-invalid-dep = yes
option when submitting them.
Here are the common chaining rules:
after:<jobID>
= job can start once job<jobID>
has started executionafterany:<jobID>
= job can start once job<jobID>
has terminatedafterok:<jobID>
= job can start once job<jobID>
has terminated successfullyafternotok:<jobID>
= job can start once job<jobID>
has terminated upon failuresingleton
= job can start once any previous job with identical name and user has terminated
Accounting¶
Use the command sacct
to get info on your finished jobs, and sacct -j JobID
for a specific job with ID JobID
.